Multilingual and Cross-Lingual Complex Word Identification
نویسندگان
چکیده
Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility. Due to the lack of CWI datasets, previous works largely depend on Simple English Wikipedia and edit histories for obtaining ‘gold standard’ annotations, which are of mixed quality, and limited to English only. We collect complex words/phrases (CP) for English, German and Spanish, annotated by both native and non-native speakers, and propose language independent features that can be used to train multilingual and crosslingual CWI models. We show that the performance of cross-lingual CWI systems (using a model trained on one language and applying it on the other languages) is comparable to the performance of monolingual CWI systems.
منابع مشابه
Providing Cross-Lingual Information Access with Knowledge-Poor Methods
We are proposing a simple, but efficient approach for a number of multilingual and cross-lingual language technology applications that are not limited to the usual two or three languages, but that can be applied with relatively little effort to larger sets of languages. The approach consists of using existing multilingual linguistic resources such as thesauri, nomenclatures and gazetteers, as w...
متن کاملUHD: Cross-Lingual Word Sense Disambiguation Using Multilingual Co-Occurrence Graphs
We describe the University of Heidelberg (UHD) system for the Cross-Lingual Word Sense Disambiguation SemEval-2010 task (CL-WSD). The system performs CLWSD by applying graph algorithms previously developed for monolingual Word Sense Disambiguation to multilingual cooccurrence graphs. UHD has participated in the BEST and out-of-five (OOF) evaluations and ranked among the most competitive systems...
متن کاملSemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation
We propose a multilingual unsupervised Word Sense Disambiguation (WSD) task for a sample of English nouns. Instead of providing manually sensetagged examples for each sense of a polysemous noun, our sense inventory is built up on the basis of the Europarl parallel corpus. The multilingual setup involves the translations of a given English polysemous noun in five supported languages, viz. Dutch,...
متن کاملConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge
This paper describes Luminoso’s participation in SemEval 2017 Task 2, “Multilingual and Cross-lingual Semantic Word Similarity”, with a system based on ConceptNet. ConceptNet is an open, multilingual knowledge graph that focuses on general knowledge that relates the meanings of words and phrases. Our submission to SemEval was an update of previous work that builds high-quality, multilingual wor...
متن کاملProbabilistic topic modeling in multilingual settings: An overview of its methodology and applications
Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingua...
متن کامل